Our Global Execution Services (GES) offer Algorithmic Execution services to a large variety of clients.
In this space, one of the key differentiator between Algo Trading providers is how efficiently one can execute strategies a client has signed up to.
Popular algo strategies are Percentage of Volume, Pegged, VWAP, TWAP, Implementation shortfall, Target close etc.
In Equity markets, two factors that will play a significant part in being able to execute such strategies with success is Daily Volume (total volume exchanged in a day) and Close Volume (volume exchanged during the Auction Process of an exchange at the end of each day).
Indeed, sufficient liquidity is obviously critical in the actual execution of any strategy (no volume, no trades, no clients!) and being able to predict volume is of great importance to an algo decision making process.
In this project we will try to answer the following question:
Can we predict future close volumes from historical volumes and other factors?To answer this question we will try several regression models and identify the ones best suited for the task.
Our data will be daily volumes of FTSE100 consituents between 2010-01-04 and 2019-02-26.
Features
In addition to Volume derived features we will also introduce the notion of Special Days.
A Special Day will differ from a Regular Day in that it is the effective date of one of the following industry events:
Monthly Last Business Day: Simply the last day of the trading month.MSCI Europe Quarterly Review: The effective day for the latest MSCI Europe quarterly index review.MSCI Europe Semi-Annual Review: The effective day for the latest MSCI Europe semi-annual index review.Quarterly Expiry: The quarterly option expiry day.Influence of Special Days
Intuitively these Special Days are likely to influence volumes on the following basis:
Monthly Last Business Day: Many market participants have month-end operational processes which could potentially lead to increased
market activity.MSCI Europe Quarterly Review and MSCI Europe Semi-Annual Review: MSCI is one of the most influancial Equity Index provider. Their indexes are reviewed quarterly and semi-annually. Whenever a review leads to a change to an index, on the effective date of that change, a significant amount of market activity will be induced by market participants who are dependent on those indexes. (ETF, Funds,...).Quarterly Expiry: When an option expires and gets exercised the underlying equity will have to be delivered and extra market activity could be expected during the expiry cycle.Data source
The data for this project has been kindly provided by Khalil Dayri from our GES team.
We will therefore not describe how it was put together but for reference:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
import numpy as np
import pandas as pd
import pandas_profiling
import seaborn as sns
import matplotlib.pyplot as plt
from bokeh.palettes import Spectral4
from bokeh.plotting import figure, output_file, show
from bokeh.models import CrosshairTool, HoverTool, ZoomInTool, ZoomOutTool, PanTool, BoxZoomTool, NumeralTickFormatter, CustomJS, DataRange1d
from bokeh.models.sources import ColumnDataSource
from bokeh.io import output_notebook
from bokeh.palettes import Category20
import itertools
#import sklearn auxillary packages
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
#import sklearn regression models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, Lars, OrthogonalMatchingPursuit, BayesianRidge, ARDRegression, SGDRegressor, PassiveAggressiveRegressor, RANSACRegressor, TheilSenRegressor, HuberRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor, RadiusNeighborsRegressor
from sklearn.svm import SVR, NuSVR, LinearSVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.isotonic import IsotonicRegression
import lightgbm as lgb
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.api as sm
from statsmodels.tsa.statespace.sarimax import SARIMAX
from pandas.plotting import lag_plot
from itertools import product
from scipy import stats
import time
output_notebook()
%matplotlib inline
# let's load our dataset
df = pd.read_csv('./datasets/UKX_Volume_Close.csv')
df.shape
# let's have a peak at the data
df.head()
# let's check the data types
df.dtypes
# let's rename the column names to be more consistent in our naming convention
old_names = ['Ticker', 'Date', 'CLOSE_VOLUME', 'DailyVolume', 'specialDay',
'close_pct', 'VWAP_VOLUME_uts_mean_20d', 'VWAP_VOLUME_uts_median_20d',
'close_pct_uts_mean_20d', 'close_pct_uts_median_20d', 'DateType',
'weekday']
new_names = ['ticker', 'date', 'close_volume', 'daily_volume', 'special_day',
'close_pct', 'daily_volume_mean_20d', 'daily_volume_median_20d',
'close_pct_mean_20d', 'close_pct_median_20d', 'date_type',
'weekday']
df.rename(columns=dict(zip(old_names, new_names)), inplace=True)
df.head()
ticker identifies uniquely each stock constituents of the FTSE100 index.
date is the business date for which the sample relates to.
close_volume is the total volume traded during the Auction Closing process.
daily_volume is the total volume traded in one day.
special_day is a bool denoting whether that date is a Special Day or not.
close_pct is the ratio close_volume/daily_volume expressed in percentage terms.
daily_volume_mean_20d is the 20-period moving average of daily_volume.
daily_volume_median_20d is the 20-period moving median of daily_volume.
close_pct_mean_20d is the 20-period moving average of close_volume.
close_pct_median_20d is the 20-period moving median of close_volume.
date_type qualifies the type of date the sample date is out of the following values ['regular', 'Monthly Last Business day',
'MSCIEurope Quarterly Review', 'Quarterly Expiry',
'MSCIEurope Semi-Annual Review'].
weekday is the day of the week for this sample date.
profile = pandas_profiling.ProfileReport(df)
profile.to_file(outputfile="output.html")
display(profile)
date is currently of type string/categorical.
weekday is only populated for Special Days.
close_volume and daily_volume are highly skewed.
daily_volume_mean_20d, daily_volume_median_20d, close_pct_mean_20d and close_pct_median_20d have null values.
daily_volume_mean_20d and daily_volume_median_20d are highly correlated.
daily_volume_mean_20d in our model and drop daily_volume_median_20d.close_pct_mean_20d and close_pct_median_20d are highly correlated.
close_pct_mean_20d in our model and drop close_pct_median_20d.Ticker (which describe a single stock) takes 164 value.
# let's convert `date` to datetime
before = str(df.dtypes['date'])
df['date'] = pd.to_datetime(df['date'])
after = str(df.dtypes['date'])
print('Date type before: {}, after: {}'.format(before, after))
# let's convert 'special_day' to int as some issues were encountered in certain seaborn plots when using bool values!
before = str(df.dtypes['special_day'])
df['special_day'] = df['special_day'].astype(int)
after = str(df.dtypes['special_day'])
print('Special_day type before: {}, after: {}'.format(before, after))
# let's remove missing weekdays by repopulating the whole column
before = df[df.columns].isnull().any(axis=0)['weekday']
df['weekday'] = df['date'].dt.day_name()
after = df[df.columns].isnull().any(axis=0)['weekday']
print('Any missing values for weekday before: {}, after: {}'.format(before, after))
# let's print column names with null values
print('Columns with null values: {}'.format(list(df.columns[np.where(df[df.columns].isnull().any(axis=0))])))
impact_sample_count = df[df[df.columns[:-1]].isnull().any(axis=1)].shape[0]
print('There are {} samples impacted ({:.2f}% of all samples)'.format(impact_sample_count, 100*impact_sample_count/df.shape[0]))
# Let's look at some of the samples impacted by those null value
df[df[df.columns[:-1]].isnull().any(axis=1)].head()
Clearly these null values are due to the rolling mean/median columns requiring previous data to yield a value.
We will discard them as the number of impacted samples is negligable (0.05%).
# dropping samples with null values
before =df.shape[0]
df = df.dropna()
after = df.shape[0]
print('df sample count before: {}, after: {}'.format(before, after))
# let's drop daily_volume_median_20d and close_pct_median_20d
df.drop(columns=['daily_volume_median_20d', 'close_pct_median_20d'], inplace=True)
df.head()
# get list of unique tickers
tickers = df.ticker.unique()
# create list of new tickers keeping only the first word (the actual ticker)
new_tickers = np.array([x.split(' ')[0] for x in tickers])
# replace each ticker with its new value using the map function which takes as a parameter a dict of key:value defined as 'old_value': 'new_value'
df.ticker = df.ticker.map(dict(zip(tickers, new_tickers)))
print(df.ticker.unique())
# let's take a closer look at a specific ticker, say 'ABF' and let's make sure that we have one record per date
# we first group all ABF samples per date and count the number of samples in each group
abf = df.loc[df.ticker=='ABF'][['date', 'ticker']].groupby(['date']).count().rename(columns={'ticker': 'ct'})
# have we got any group with more than one sample?
abf.loc[abf.ct>1].shape
Well that's unexpected, it seems in some instances (33 dates for ABF) we have more than one sample for a given ticker and date!
# let's check for ABF which date are represented more than once
df.loc[df.ticker=='ABF'][df.date.isin(abf.loc[abf.ct>1].index)]
Ok so apparently each Special Day type is represented with one individual sample, meaning that if several Special Days occur on the same day we will have duplicated samples for that day.
Furthermore, as we don't have any regular days for these duplicated dates above, it means that for dates where we have a Special Day we won't have a regular day, meaning that we will have a broken time series if we only look at regular days, and we will get duplicates if we look at all samples.
We need a better way to represent the Special Days information, one which doesn't involve messing with our time series continuity.
What we propose here is to rebuild a new dataframe with one sample per tuple (ticker, date) and encode the date_type information as binary 0/1 flags for each of the values it currently takes.
Let's go through the details!
# first let's take all unique day samples from df whilst dropping column 'date_type' as it will be reengineered.
# to do this we take all unique tuples (ticker, date, close_volume, daily_volume, special_day, close_pct, daily_volume_mean_20d, close_pct_mean_20d, weekday)
# this will give us all unique (ticker, date) observations, most of which are regular days and some of which are special days but without duplicates if several special days occured on a same date.
df_new = df[list(df.columns[df.columns!='date_type'])].drop_duplicates().copy()
df_new.shape
df_new.head(2)
# second let's create new columns to hold the 0/1 flag encoding of the original 'date_type' column.
df_new['date_type_regular'] = 0
df_new['date_type_monthly_last_business_day'] = 0
df_new['date_type_msci_quarterly_review'] = 0
df_new['date_type_msci_semi_annual_review'] = 0
df_new['date_type_quarterly_expiry'] = 0
# defining a list with all date_types corresponding to a special day
date_types = ['Monthly Last Business day', 'MSCIEurope Quarterly Review', 'MSCIEurope Semi-Annual Review', 'Quarterly Expiry']
# defining a list of target column names for the 0/1 flag encoding of the special days date_types column
date_type_ohe_col_names = ['date_type_monthly_last_business_day', 'date_type_msci_quarterly_review', 'date_type_msci_semi_annual_review', 'date_type_quarterly_expiry']
# looping through all tuple (date_type, date_type_ohe_col_name) to compute the binary flag encoding of each special day date_types
for date_type, date_type_ohe_col_name in zip(date_types, date_type_ohe_col_names):
df_new = pd.merge(df_new, df.loc[(df.date_type==date_type)][['ticker', 'date', 'date_type']], left_on=['ticker', 'date'], right_on=['ticker', 'date'], how='left')
df_new.loc[df_new.date_type==date_type, date_type_ohe_col_name] = 1
df_new.drop(columns=['date_type'], inplace=True)
# now knowing all special days we can compute the binary flag encoding column for regular days (date_type_regular) like so:
df_new.loc[(df_new.date_type_monthly_last_business_day==0) &
(df_new.date_type_msci_quarterly_review==0) &
(df_new.date_type_msci_semi_annual_review==0) &
(df_new.date_type_quarterly_expiry==0), 'date_type_regular'] = 1
df_new.head()
Now that we've re-encoded the original date_type column into the following binary flags: date_type_regular, date_type_monthly_last_business_day, date_type_msci_quarterly_review, date_type_msci_semi_annual_review, date_type_quarterly_expiry, let's make sure we haven't lost any information and kept the original data consistency.
# firstly, let's check that the overall number of samples is still the same
original_sample_count = df[df.special_day==0].shape[0] + df[df.special_day==1][['date', 'ticker']].drop_duplicates().shape[0]
new_sample_count = df_new.shape[0]
print('SAMPLE COUNT CHECK, original: {}, new: {}, result: {}'.format(original_sample_count, new_sample_count, 'PASSED' if original_sample_count==new_sample_count else 'FAILED'))
# secondly, let's check that for each ticker we have the same special days, we'll do this by isolating for each special day type, all tuples (ticker, date) into 2 dataframes: one for original dataset (df)
# and one for the new dataset (df_new), then we'll compare the two dataframes using the pandas function 'equals'.
for date_type, date_type_ohe_col_name in zip(date_types, date_type_ohe_col_names):
original_date_type_df = df[df.date_type==date_type][['ticker', 'date']].reset_index(drop=True)
new_date_type_df = df_new[df_new[date_type_ohe_col_name]==1][['ticker', 'date']].reset_index(drop=True)
result = original_date_type_df.equals(new_date_type_df)
print('DATE TYPE DATA CHECK, for {}, result: {}'.format(date_type, 'PASSED' if result else 'FAILED'))
# lastly, let's check that the special_day counts for (date, ticker) tuples are consistent between original and new datasets and that they match the information encoded in the special days 0/1 flag columns
original_special_day_count = df.loc[df.special_day==1][['date', 'ticker']].drop_duplicates().shape[0]
new_special_day_count = df_new.loc[df_new.special_day==1].shape[0]
new_special_day_count_encoded = df_new.loc[(df_new.date_type_monthly_last_business_day==1) |
(df_new.date_type_msci_quarterly_review==1) |
(df_new.date_type_msci_semi_annual_review==1) |
(df_new.date_type_quarterly_expiry==1)].shape[0]
result = (original_special_day_count == new_special_day_count == new_special_day_count_encoded)
print('SPECIAL DAY COUNT CHECK, original: {}, new: {}, encoded: {}, result: {}'.format(original_special_day_count, new_special_day_count, new_special_day_count_encoded, 'PASSED' if result else 'FAILED'))
The new encoding for the date type feature seems to be a success!
# last consideration in our data cleaning section: special_day and date_type_regular are simply the opposite of one another so no need to keep them both, let's drop date_type_regular
df_new.drop(columns=['date_type_regular'], inplace=True)
# finally, now that new dataframe is finalized let's replace df with it so we can use df in this notebook going forward instead of df_new.
df = df_new
df.head()
# let's add a 20-day moving average for close_volume
df['close_volume_mean_20d'] = 0
tickers = df.ticker.unique()
for ticker in tickers:
df.loc[df.ticker==ticker, 'close_volume_mean_20d'] = df.loc[df.ticker==ticker].close_volume.rolling(20).mean()
# reordering columns and removing NaNs
df = df[['ticker', 'date', 'weekday', 'special_day', 'daily_volume', 'close_volume',
'close_pct', 'daily_volume_mean_20d', 'close_volume_mean_20d', 'close_pct_mean_20d',
'date_type_monthly_last_business_day',
'date_type_msci_quarterly_review', 'date_type_msci_semi_annual_review',
'date_type_quarterly_expiry']].dropna()
df.head()
# plotting a correlation heatmap
plt.rcParams['figure.figsize'] = (16, 10)
sns.set(font_scale=1.2)
sns.heatmap(df.corr(), annot = True)
plt.title('Correlation heatmap of the dataset', fontsize = 30)
plt.show()
# let's visualise our data as a pairplot (excluding the 0/1 flag columns)
sns.pairplot(df[df.columns[3:10]])
From the previous two plots we can confirm the following insights (validating our conclusions from the results obtained with pandas_profiling):
close_volume, daily_volume and daily_volume_mean_20d are highly skewed.special_day is not well balanced, the 'regular' class having a much higher frequency than the 'special_day' class.Therefore:
close_volume, daily_volume, daily_volume_mean_20d and inspect the impact on our models.special_day is not balanced and avoid models which are known to be sensitive to unbalanced data (e.g. decision trees).Regarding close volumes, it seems that they take a much wider range of values on a special day (special_day==1).
Let's drill further in the relationship between close_volume and special_day.
To complete this task we will consider average daily volumes across all tickers.
def compute_average_daily_volume_all_tickers():
# let's compute a daily average of our volumes across all tickers
df_ = df.groupby(['date']).mean().reset_index()
# recomputing close_pct and close_pct_mean_20d for average daily volumes
df_['close_pct'] = 100 * df_.close_volume / df_.daily_volume
df_['close_pct_mean_20d'] = 100 * df_.close_volume_mean_20d / df_.daily_volume_mean_20d
# let's add a numerical features for each component of the date field
df_['weekday'] = df_.date.dt.weekday
df_['day'] = df_.date.dt.day
df_['week'] = df_.date.dt.week
df_['month'] = df_.date.dt.month
df_['year'] = df_.date.dt.year
# let's compute log transform for those 3 volume features
df_['log_daily_volume'] = np.log(df_['daily_volume'])
df_['log_close_volume'] = np.log(df_['close_volume'])
df_['log_close_pct'] = np.log(df_['close_pct'])
df_['log_daily_volume_mean_20d'] = np.log(df_['daily_volume_mean_20d'])
df_['log_close_volume_mean_20d'] = np.log(df_['close_volume_mean_20d'])
df_['log_close_pct_mean_20d'] = np.log(df_['close_pct_mean_20d'])
return df_
# compute average daily volume across all tickers
df_avg = compute_average_daily_volume_all_tickers()
df_avg.head()
# now let's visualize the distribution of these features
f, axes = plt.subplots(2, 3, figsize=(20, 10))
sns.distplot(df_avg[(df_avg.special_day==0)].daily_volume, ax=axes[0, 0])
sns.distplot(df_avg[(df_avg.special_day==0)].close_volume, ax=axes[0, 1])
sns.distplot(df_avg[(df_avg.special_day==0)].close_pct, ax=axes[0, 2])
sns.distplot(df_avg[(df_avg.special_day==1)].daily_volume, ax=axes[1, 0])
sns.distplot(df_avg[(df_avg.special_day==1)].close_volume, ax=axes[1, 1])
sns.distplot(df_avg[(df_avg.special_day==1)].close_pct, ax=axes[1, 2])
From the previous distribution figure, we can see in the first row, volume features for regular days and in the second row, the same for special days.
We can see that all 3 features are skewed, especially close_volume.
To mitigate the impact on this skewness on our model, we will apply a log transform onto volume features.
# now let's visualize the distribution of these log transformed features
f, axes = plt.subplots(2, 3, figsize=(20, 10))
sns.distplot(df_avg[(df_avg.special_day==0)].log_daily_volume, ax=axes[0, 0])
sns.distplot(df_avg[(df_avg.special_day==0)].log_close_volume, ax=axes[0, 1])
sns.distplot(df_avg[(df_avg.special_day==0)].log_close_pct, ax=axes[0, 2])
sns.distplot(df_avg[(df_avg.special_day==1)].log_daily_volume, ax=axes[1, 0])
sns.distplot(df_avg[(df_avg.special_day==1)].log_close_volume, ax=axes[1, 1])
sns.distplot(df_avg[(df_avg.special_day==1)].log_close_pct, ax=axes[1, 2])
close_volume is much less skewed now.
# let's look at boxplots now
f, axes = plt.subplots(1, 3, figsize=(20, 5))
sns.boxplot(x='special_day', y='log_daily_volume',data=df_avg, ax=axes[0])
sns.boxplot(x='special_day', y='log_close_volume',data=df_avg, ax=axes[1])
sns.boxplot(x='special_day', y='log_close_pct',data=df_avg, ax=axes[2])
Volumes on special days are clearly significantly higher than on normal days.
# let's look at whether the day of the week has any impact on volumes
# let's first re-establish the weekday string column in our average volume dataframe
df_avg['weekday_str'] = df_avg.date.dt.weekday_name
# then plot a boxplot for our 3 volume features
f, axes = plt.subplots(1, 3, figsize=(20, 5))
sns.boxplot(x='weekday_str', y='log_daily_volume',data=df_avg, ax=axes[0])
sns.boxplot(x='weekday_str', y='log_close_volume',data=df_avg, ax=axes[1])
sns.boxplot(x='weekday_str', y='log_close_pct',data=df_avg, ax=axes[2])
These plots show:
Monday is much quieter than any other days with significantly lower volumes.close_volume and close_pct take a wider range of value on Friday, arguably because quite a few special days fall on Fridays as the following bar chart proves:df[df.special_day==1][['ticker', 'weekday']].groupby(['weekday']).count().rename(columns={'ticker': 'ct'}).reset_index().plot.bar(x='weekday', y='ct')
daily_volume¶def chart_one_measure(measure, x_range_init=None):
tools = [CrosshairTool(), ZoomInTool(), ZoomOutTool(), PanTool(), BoxZoomTool()]
p = figure(plot_width=1600, plot_height=550, x_axis_type="datetime", tools=tools, x_range=x_range_init)
p.title.text = measure
data_dict = dict()
tickers = [measure] +['Monthly Last Business Day', 'MSCI Quarterly Review', 'MSCI Semi Annual Review', 'Quarterly Expiry']
conditions = ['', 'date_type_monthly_last_business_day', 'date_type_msci_quarterly_review',
'date_type_msci_semi_annual_review', 'date_type_quarterly_expiry']
colors = ['darkgrey', 'crimson', 'blue', 'darkgreen', 'orange']
for ticker_, cond_, color_ in zip(tickers, conditions, colors):
if cond_ == '':
df_chart = df_avg
else:
df_chart = df_avg.loc[df_avg[cond_]==1]
li = p.line(df_chart['date'], df_chart[measure], line_width=2, color=color_, alpha=0.8, legend=ticker_)
if cond_ == '':
p.add_tools(HoverTool(renderers=[li], tooltips=[('date', '@x{%F}'), (measure, '@y{0,0}')], mode='vline', formatters={'x': 'datetime'}))
else:
p.circle(df_chart['date'], df_chart[measure], fill_color=color_, size=8)
data_dict[ticker_] = ColumnDataSource(ColumnDataSource.from_df(df_chart))
p.legend.location = "top_left"
p.legend.click_policy="hide"
p.yaxis.formatter = NumeralTickFormatter(format="0,0")
y_range_callback = CustomJS(args=dict(plot=p, source=data_dict, tickers=tickers, measure=measure), code="""var Xstart=plot.x_range.start,Xend=plot.x_range.end;function sGE(a){return a>=Xstart}function eGE(a){return a>=Xend}function fixstart(a){return 0<=a?a:x.length-20}function fixend(a){return 0<a?a:x.length-1}for(var data=source[tickers[0]].data,x=data.date,Istart=fixstart(x.findIndex(sGE)),Iend=fixend(x.findIndex(eGE)),ymin=999999999999,ymax=-99999999999,i=0;i<tickers.length;i++){data=source[tickers[i]].data;var y=data[measure],yview=y.slice(Istart,Iend+1);ymin=Math.min(...yview,ymin),ymax=Math.max(...yview,ymax)}var dy=ymax-ymin,margin=.2*dy,last_range=plot.y_range.end-plot.y_range.start,new_range=ymax+margin-(ymin-margin),variation=Math.abs((new_range-last_range)/last_range);.05<variation&&(plot.y_range.start=ymin-margin,plot.y_range.end=ymax+margin);""")
p.y_range.callback = y_range_callback
show(p)
# let's plot average daily volumes with an initial zoom on the period April-June 2014
x_range_init = (pd.to_datetime(df_avg.loc[df_avg.date>='2014-05-01', 'date'].head(1).values[0]), pd.to_datetime(df_avg.loc[df_avg.date>='2014-07-01', 'date'].head(1).values[0]))
display(chart_one_measure('daily_volume', x_range_init))
What can be learnt from this chart:
Quarterly Expiry seems to be the most correlated to a spike in daily_volume_meanMSCI Semi Annual Review seems to be the second most correlated to a spike in daily_volume_meanMSCI Semi Annual Review are closed to one another:
a. one is on 2014-05-28 and has low volume
b. one is on 2014-05-30 and has high volume
We will remove the MSCI Semi Annual Review flag for 2014-05-28 as it seems to be a glitch in the raw data.# let's remove MSCI Semi Annual Review flag for 2014-05-28
df.loc[df_new.date=='2014-05-28', 'date_type_msci_semi_annual_review'] = 0
# let's re-compute average daily volume across all tickers
df_avg = compute_average_daily_volume_all_tickers()
# let's plot again the average daily volumes with an initial zoom on the period April-June 2014
x_range_init = (pd.to_datetime(df_avg.loc[df_avg.date>='2014-05-01', 'date'].head(1).values[0]), pd.to_datetime(df_avg.loc[df_avg.date>='2014-07-01', 'date'].head(1).values[0]))
display(chart_one_measure('daily_volume', x_range_init))
This anomaly is now gone.
# let's plot more volume related measures to get a feel of which might be more correlated to our special days
display(chart_one_measure('daily_volume'))
display(chart_one_measure('close_volume'))
display(chart_one_measure('close_pct'))
display(chart_one_measure('close_volume_mean_20d'))
display(chart_one_measure('close_pct_mean_20d'))
Looking at these charts it seems that predicting close_volume_mean_20d and close_pct_mean_20d would be much easier than close_volume and close_pct which makes sense as they're smoothed and less noisy.
As we will be exploring many linear and non linear models, we create functions to serve the following purposes:
def create_dataset(df_, test_size_, features_, features_shifts_, target_col_name_, with_normalization_=True, random_state_=1234, show_log_=False):
"""
:param df_: input dataframe containing both features and targets.
:param test_size_: size of the test dataset in percentage of the total size of the input dataframe.
:param target_col_name_: the name of the column to predict.
:param with_normalization: if True the dataset is normalized, otherwize it is left as is.
:param random_state_: seed for the random number generator used by sklearn.
:param show_log_: if True prints log to the console
:return: x_train, x_test, y_train, y_test corresponding to features and targets for train and test datasets.
"""
# let's start by taking a copy of the input dataframe
df = df_.copy()
# let's use the 'date' column as the index for our new dataframe df
df = df.set_index('date')
# now let's make sure we're not leaking data into the future by reviewing at which date D our features and targets will taken from:
# 1. Targets: for each sample at date D, the target we will try to predict is given by target_col_name_ at date D
# 2. Special Days Features and weekday: all date_type features and weekday at date D can be used for date D prediction as their values are know well in advance
# 3. Previous target_col_name_ feature and any other 20-day average feature: for each sample at date D, these feature will be brought one day forward from D-1 to D as their values
# on date D are only known at the end of the day on D meaning they cannot be used to make a prediction on day D
# compute previous target_col_name_ value
prev_target_col_name_ = 'prev_{}'.format(target_col_name_)
df[prev_target_col_name_] = df[target_col_name_].shift(1)
# let's shift feature columns as specified by features_shifts_:
# in the process we rename the new features with the following naming convention:
# *if shift_value > 0: {feature_name}_{p}{shift_value} (p as in positif)
# *if shift_value < 0: {feature_name}_{n}{shift_value} (n as in negatif)
# *if shift_value == 0: feature_name (as in we don't change the feature name)
new_features_ = list()
for feature, shift_value in zip(features_, features_shifts_):
new_feature_name = feature
if shift_value != 0:
new_feature_name = '{}_{}{}'.format(feature, 'p' if shift_value > 0 else 'n', shift_value)
df[new_feature_name] = df[feature].shift(shift_value)
new_features_.append(new_feature_name)
features_ = new_features_
# let's only keep relevant columns (features + targets)
features_ = [prev_target_col_name_] + features_
targets_ = [target_col_name_]
df = df[features_+targets_].dropna()
if show_log_:
print('features_: {}'.format(features_))
print('targets_: {}'.format(targets_))
print(df.head().to_string())
# create train and test datasets
x = df[features_]
y = df[targets_]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = test_size_, random_state = random_state_, shuffle=False)
# data normalization
scaler = None
if with_normalization_:
# feature normalization
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
return x_train, x_test, y_train, y_test, features_, x, y, scaler
# run grid search with cross validation
def run_grid_search_cv(model, grid_params, x, y, cv_count=30):
gs = GridSearchCV(model,
grid_params,
verbose=1,
cv=cv_count,
n_jobs = -1
)
gs_results = gs.fit(x, y)
print(gs_results.best_score_)
print(gs_results.best_estimator_)
print(gs_results.best_params_)
return gs_results
def run_pipeline(df_, models_, test_size_, features_, features_shifts_, target_col_name_, with_normalization_=True, random_state_=1234, show_log_=False):
"""
:param df_: input dataframe containing both features and targets.
:param models_: list of models to work with.
:param models_params: list of model specific parameters to use.
:param test_size_: size of the test dataset in percentage of the total size of the input dataframe.
:param target_col_name_: the name of the column to predict.
:param with_normalization: if True the dataset is normalized, otherwize it is left as is.
:param random_state_: seed for the random number generator used by sklearn.
:param show_log_: if True prints log to the console
:return: x_train, x_test, y_train, y_test corresponding to features and targets for train and test datasets.
"""
# create train and test datasets
x_train, x_test, y_train, y_test, features_final, x, y, scaler = create_dataset(df_, test_size_, features_, features_shifts_, target_col_name_, with_normalization_=with_normalization_, random_state_=random_state_, show_log_=show_log_)
# train and evaluate each models
kpi = dict()
for model_id, [model, model_params, model_grid_params] in models_.items():
print('Start processing: {} of class {} with params {}'.format(model_id, model.__class__.__name__, model_params))
start_time = time.time()
is_model_trained = False
# run grid search if required
if model_grid_params:
print('Running GridSearchCV for model: {}'.format(model_id))
gs_results = run_grid_search_cv(model, model_grid_params, x_train, y_train, cv_count=5)
model_params = gs_results.best_params_
models_[model_id][0] = gs_results.best_estimator_
model = models_[model_id][0]
is_model_trained = True
if not is_model_trained:
# setup model's parameters
if model_params:
if model_id == 'LightGBM_Regressor':
model.set_params(model_params)
else:
model.set_params(**model_params)
# model training
if model_id == 'LightGBM_Regressor':
model.fit(x_train, y_train, x_test, y_test)
else:
model.fit(x_train, y_train)
# model predictions with both train and test datasets
y_train_preds = model.predict(x_train)
y_test_preds = model.predict(x_test)
# model evaluation using MSE and R2
train_mse = mean_squared_error(y_train, y_train_preds)
train_mae = mean_absolute_error(y_train, y_train_preds)
train_r2 = r2_score(y_train, y_train_preds)
test_mse = mean_squared_error(y_test, y_test_preds)
test_mae = mean_absolute_error(y_test, y_test_preds)
test_r2 = r2_score(y_test, y_test_preds)
# model accuracy using cross validation
if model_id == 'LightGBM_Regressor':
scores_mean = 0
scores_std = 0
else:
scores = cross_val_score(model, x_train, y_train, cv=5)
scores_mean = scores.mean()
scores_std = scores.std()
end_time = time.time()
kpi[model_id] = {'train_mse': train_mse,
'train_r2': train_r2,
'train_mae': train_mae,
'test_mse': test_mse,
'test_r2': test_r2,
'test_mae': test_mae,
'train_acc': scores_mean,
'train_std': scores_std,
'processing_time': end_time - start_time
}
kpi_df = pd.DataFrame([v for k, v in kpi.items()], index=list(kpi.keys()))
dataset = {'features': features_final, 'target': target_col_name_,
'x': x, 'y': y, 'x_train': x_train, 'x_test': x_test, 'y_train': y_train, 'y_test': y_test,
'scaler': scaler}
return kpi_df, dataset
def show_pipeline_results(features_final, kpis, kpis_sort_field='test_r2'):
pd.options.display.max_columns = None
pd.options.display.max_colwidth = 400
pd.set_option('display.float_format', lambda x: '%.3f' % x)
print('features_final: {}'.format(features_final))
display(kpis.sort_values([kpis_sort_field], ascending=False))
def chart_predictions(x, y, x_train, y_train, x_test, y_test, models, model_names, x_range_init=None):
"""
:param x: the original feature matrix (dataframe).
:param y: the original target vector (dataframe).
:param x_train: the normalized train feature vectors (ndarray).
:param y_train: the normalized train target vector (ndarray).
:param x_test: the normalized test feature vectors (ndarray).
:param y_test: the normalized test feature vector (ndarray).
:param models: dict of all models.
:param model_names: list of all model names we wish to plot predictions for.
:param target_name: name of the predicted column.
:return: None.
"""
tools = [CrosshairTool(), ZoomInTool(), ZoomOutTool(), PanTool(), BoxZoomTool()]
p = figure(plot_width=1600, plot_height=550, x_axis_type="datetime", tools=tools, x_range=x_range_init)
measure = y.columns[0]
p.title.text = measure
data_dict = dict()
model_names = ['actual'] + model_names
# getting some colors
colors = ['black'] + Category20[20]
# plotting a verticle black ray to separate train and test data
p.ray(x = y.index[x_train.shape[0]-1], y = 0, length = 0, angle_units = "deg",
angle = 90, color = 'black', line_width=2)
for model_name_, color_ in zip(model_names, colors):
if model_name_ == 'actual':
# case to handle the true target data held in variable y
df_chart = y
line_alpha = 1
elif model_name_.find('arima')>0:
# case to handle arima based models
line_alpha = 0.5
model = models[model_name_][0]
# for arima based model, predictions can be obtained from the model's object directly
df_chart = model.df_preds
else:
# case to handle any other models
line_alpha = 0.5
model = models[model_name_][0]
df_chart = y.copy()
df_chart[measure] = np.hstack([model.predict(x_train).ravel(), model.predict(x_test).ravel()])
li = p.line(df_chart.index, df_chart[measure], line_width=2, color=color_, alpha=0.8, legend=model_name_, line_alpha=line_alpha)
p.add_tools(HoverTool(renderers=[li], tooltips=[('date', '@x{%F}'), (model_name_, '@y{0,0}')], mode='vline', formatters={'x': 'datetime'}))
data_dict[model_name_] = ColumnDataSource(ColumnDataSource.from_df(df_chart))
# draw circle for special days
index_ = x.loc[x.special_day==1].index
values_ = y.loc[index_][measure]
p.circle(index_, values_, fill_color='black', size=8)
p.legend.location = "top_left"
p.legend.click_policy="hide"
p.yaxis.formatter = NumeralTickFormatter(format="0,0")
y_range_callback = CustomJS(args=dict(plot=p, source=data_dict, tickers=model_names, measure=measure), code="""var Xstart=plot.x_range.start,Xend=plot.x_range.end;function sGE(a){return a>=Xstart}function eGE(a){return a>=Xend}function fixstart(a){return 0<=a?a:x.length-20}function fixend(a){return 0<a?a:x.length-1}for(var data=source[tickers[0]].data,x=data.date,Istart=fixstart(x.findIndex(sGE)),Iend=fixend(x.findIndex(eGE)),ymin=999999999999,ymax=-99999999999,i=0;i<tickers.length;i++){data=source[tickers[i]].data;var y=data[measure],yview=y.slice(Istart,Iend+1);ymin=Math.min(...yview,ymin),ymax=Math.max(...yview,ymax)}var dy=ymax-ymin,margin=.2*dy,last_range=plot.y_range.end-plot.y_range.start,new_range=ymax+margin-(ymin-margin),variation=Math.abs((new_range-last_range)/last_range);.05<variation&&(plot.y_range.start=ymin-margin,plot.y_range.end=ymax+margin);""")
p.y_range.callback = y_range_callback
show(p)
# let's create a class to support the sklearn API (fit, predict) for lightgbm
class LightGBM_Regressor:
def __init__(self):
self.params = None
self.model = None
def set_params(self, params):
self.params = params
def fit(self, x_train, y_train, x_test, y_test):
d_train = lgb.Dataset(x_train, label=y_train)
d_valid = lgb.Dataset(x_test, label=y_test)
self.model = lgb.train(self.params, d_train, valid_sets=d_valid, verbose_eval=0, num_boost_round=200)
def predict(self, x_test):
if self.model is None:
return np.array()
return self.model.predict(x_test)
def models_init():
"""
Defines the models that will be considered during training and testing.
Having a function to do this task will allow us to re-initialize all considered model between each experiments.
"""
# below 'models' variable is a dictionary which defines the models, their parameters as well as grid search parameters if required.
# its structure is as follow:
# 'key': [model_instance(), {model_params}, {grid_search_params}]
models = {'LinearRegression': [LinearRegression(), {}, {}],
'Ridge': [Ridge(), {}, {}],
'Ridge GS': [Ridge(alpha=10.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='saga', tol=0.001), {}, {}],
'Lasso': [Lasso(), {}, {}],
'Lasso GS': [Lasso(), {}, {'alpha': [1.0, 5.0, 10.0, 0.1, 0.01], 'tol': [0.0001, 0.001, 0.01]}],
'ElasticNet': [ElasticNet(), {}, {}],
'ElasticNet GS': [ElasticNet(), {}, {'alpha': [1.0, 5.0, 10.0, 0.1, 0.01], 'l1_ratio': [1, 0, 0.5]}],
'Lars': [Lars(), {}, {}],
'Lars GS': [Lars(), {}, {'n_nonzero_coefs': [500, 400]}],
'OrthogonalMatchingPursuit': [OrthogonalMatchingPursuit(), {}, {}],
'BayesianRidge': [BayesianRidge(), {}, {}],
'BayesianRidge GS': [BayesianRidge(), {}, {'n_iter': [300, 200, 400], 'tol': [0.001, 0.01, 0.1, 0.0001], 'alpha_1': [1e-06, 1e-05, 1e-04], 'alpha_2': [1e-06, 1e-05, 1e-04], 'lambda_1': [1e-06, 1e-05, 1e-04], 'lambda_2': [1e-06, 1e-05, 1e-04]}],
'SGDRegressor': [SGDRegressor(), {}, {}],
'SGDRegressor GS': [SGDRegressor(), {'alpha': 0.0001, 'l1_ratio': 1, 'learning_rate': 'optimal', 'loss': 'squared_loss', 'max_iter': 2000, 'penalty': 'none'}, {}],
'PassiveAggressiveRegressor': [PassiveAggressiveRegressor(), {}, {}],
'PassiveAggressiveRegressor GS': [PassiveAggressiveRegressor(), {}, {'C': [1, 10, 100, 0.1], 'max_iter': [1000, 500, 1500, 2000], 'tol': [1e-04, 1e-03, 1e-02]}],
'HuberRegressor': [HuberRegressor(), {}, {}],
'HuberRegressor GS': [HuberRegressor(), {}, {'epsilon': [1, 1.35, 2.0, 2.5], 'max_iter': [100, 300, 500], 'alpha': [1e-02, 1e-03, 1e-04, 1e-05], 'tol': [1e-05, 1e-04, 1e-03]}],
'DecisionTreeRegressor': [DecisionTreeRegressor(), {}, {}],
'DecisionTreeRegressor GS': [DecisionTreeRegressor(), {}, {'criterion': ['mse', 'mae'], 'max_depth': [None, 5, 10, 50], 'min_samples_split': [2, 4, 8, 16], 'max_features': [None, 'auto', 'sqrt', 'log2']}],
'KNeighborsRegressor': [KNeighborsRegressor(), {}, {}],
'KNeighborsRegressor GS': [KNeighborsRegressor(), {}, {'n_neighbors': [3, 5, 7, 10, 15], 'weights': ['uniform', 'distance'], 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'leaf_size': [10, 30, 50], 'p': [1, 1.5, 2]}],
'SVR': [SVR(), {'gamma': 'scale'}, {}],
#'SVR GS': [SVR(), {}, {'kernel': ['linear', 'poly'], 'degree': [2, 3, 5], 'gamma': ['scale', 'auto'], 'coef0': [0, 0.5],
# 'tol': [1e-03, 1e-2], 'C': [1, 10], 'epsilon': [1e-1, 1e-2]}],
#'SVR rs GS': [SVR(), {}, {'kernel': ['rbf', 'sigmoid'], 'degree': [2, 3, 5], 'gamma': ['scale', 'auto'], 'coef0': [0, 0.5],
# 'tol': [1e-03, 1e-2], 'C': [1, 10], 'epsilon': [1e-1, 1e-2]}],
#'NuSVR': [NuSVR(), {'gamma': 'scale'}, {}],
#'NuSVR GS': [NuSVR(), {}, {'nu': [0.2, 0.5, 0.7], 'C': [1, 10, 100], 'kernel': ['linear', 'poly', 'rbf', 'sigmoid'], 'degree': [3, 5],
# 'gamma': ['scale', 'auto'], 'tol': [1e-02, 1e-03],}],
#'LinearSVR': [LinearSVR(), {}, {}],
#'LinearSVR GS': [LinearSVR(), {}, {'epsilon': [0], 'tol': [1e-02, 1e-03], 'C': [1, 10, 100],
# 'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive'], 'dual': [True], 'max_iter': [1000, 500, 2000]}],
'KernelRidge': [KernelRidge(), {}, {}],
'KernelRidge GS': [KernelRidge(), {}, {'alpha': [1, 2, 10], 'kernel': ['linear'], 'degree': [2, 3, 5]}],
'LightGBM_Regressor': [LightGBM_Regressor(), {'boosting_type': 'gbdt',
'objective': 'regression',
'metric': {'rmse'},
'learning_rate': 0.5,
'num_leaves': 150,
'verbose': 0}, {}]
}
return models
close_volume¶# establishing a baseline model
models = models_init()
test_size = 0.3
target_col_name = 'close_volume'
features = ['special_day', 'weekday', 'date_type_monthly_last_business_day', 'date_type_msci_quarterly_review', 'date_type_msci_semi_annual_review', 'date_type_quarterly_expiry']
features_shifts = [0, 0, 0, 0, 0, 0]
with_normalization = True
kpis, dataset = run_pipeline(df_avg, models, test_size, features, features_shifts, target_col_name, with_normalization_=with_normalization)
show_pipeline_results(dataset['features'], kpis)
# let's plot in & out of sample predictions for the top 5 models
# note: we will zoom on the first 40 out-sample predictions, you can pan right or left to visualize more out-sample predictions or in-sample ones.
# note: special day will be marked with a black circle.
# note: the separation between in & out of sample data is marked with a vertical black line.
top_5_model_names = list(kpis.sort_values('test_r2', ascending=False).index[:5])
#top_5_model_names = ['ElasticNet', 'ElasticNet GS']
test_start_date = dataset['x'].index[dataset['x_train'].shape[0]]
init_end_date = dataset['x'].index[dataset['x_train'].shape[0]+40]
x_range_init = (pd.to_datetime(test_start_date), pd.to_datetime(init_end_date))
chart_predictions(dataset['x'], dataset['y'], dataset['x_train'], dataset['y_train'], dataset['x_test'], dataset['y_test'], models, top_5_model_names, x_range_init=x_range_init)
We can see that the baseline models already show a decent ability to predict spikes in close volume for special days.
Outside of special days the models struggle a little more.
# using more available features to improve our baseline model
# we're adding 'daily_volume', 'daily_volume_mean_20d', 'close_volume_mean_20d' and 'close_pct_mean_20d'
# we're making sure to shift forward those additional 4 features by 1 period to avoid forward-looking bias.
models = models_init()
test_size = 0.3
target_col_name = 'close_volume'
features = ['special_day', 'weekday', 'daily_volume', 'daily_volume_mean_20d', 'close_volume_mean_20d', 'close_pct_mean_20d', 'date_type_monthly_last_business_day', 'date_type_msci_quarterly_review', 'date_type_msci_semi_annual_review', 'date_type_quarterly_expiry']
features_shifts = [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
with_normalization = True
kpis, dataset = run_pipeline(df_avg, models, test_size, features, features_shifts, target_col_name, with_normalization_=with_normalization, show_log_=False)
show_pipeline_results(dataset['features'], kpis)
We can see a clear improvement in out of sample R^2 and MAE.
# let's plot in & out of sample predictions for the top 5 models
# note: we will zoom on the first 40 out-sample predictions, you can pan right or left to visualize more out-sample predictions or in-sample ones.
# note: special day will be marked with a black circle.
# note: the separation between in & out of sample data is marked with a vertical black line.
top_5_model_names = list(kpis.sort_values('test_r2', ascending=False).index[:5])
test_start_date = dataset['x'].index[dataset['x_train'].shape[0]]
init_end_date = dataset['x'].index[dataset['x_train'].shape[0]+40]
x_range_init = (pd.to_datetime(test_start_date), pd.to_datetime(init_end_date))
chart_predictions(dataset['x'], dataset['y'], dataset['x_train'], dataset['y_train'], dataset['x_test'], dataset['y_test'], models, top_5_model_names, x_range_init=x_range_init)
In this experiment, we will seek to improve our model further by adding new features derived from existing ones using rolling measures and lags.
# first le'ts define a couple of utility function that will help us enrich our existing dataframe from which our train and test datasets are created from.
def compute_derived_feature(df, col_names, functions, suffix='', samples_query=None, add_to_source_df=False):
if samples_query is not None:
df_ = df.query(samples_query)[col_names].copy()
else:
df_ = df[col_names].copy()
for function_type, function_param in functions:
if function_type == 'rolling_mean':
df_ = df_.rolling(function_param).mean()
suffix = '{}_rmean{}'.format(suffix, function_param)
elif function_type == 'rolling_min':
df_ = df_.rolling(function_param).min()
suffix = '{}_rmin{}'.format(suffix, function_param)
elif function_type == 'rolling_max':
df_ = df_.rolling(function_param).max()
suffix = '{}_rmax{}'.format(suffix, function_param)
elif function_type == 'rolling_std':
df_ = df_.rolling(function_param).std()
suffix = '{}_rstd{}'.format(suffix, function_param)
elif function_type == 'shift' and function_param != 0:
df_ = df_.shift(function_param)
suffix = '{}_{}{}'.format(suffix, 'p' if function_param > 0 else 'n', function_param)
col_names = df_.columns
new_col_names = ['{}_{}'.format(c, suffix) for c in col_names]
df_ = df_.rename(columns=dict(zip(col_names, new_col_names)))
#df_ = df_.dropna()
if add_to_source_df:
df = pd.concat([df, df_], axis=1, sort=False).ffill() #.dropna()
return df
return df_
def create_advanced_df(df):
df_ = df.copy()
# let's add a 5-day rolling mean for special day volumes, e.g. rolling.mean(5).shift(1) for 'daily_volume', 'close_volume', 'close_pct' for samples with special_day==1
df_ = compute_derived_feature(df_, ['daily_volume', 'close_volume', 'close_pct'], [('rolling_mean', 5), ('shift', 1)], suffix='sd', samples_query='special_day==1', add_to_source_df=True)
# let's add a 2-day rolling mean for special day volumes, e.g. rolling.mean(2).shift(1) for 'daily_volume', 'close_volume', 'close_pct' for samples with special_day==1
df_ = compute_derived_feature(df_, ['daily_volume', 'close_volume', 'close_pct'], [('rolling_mean', 2), ('shift', 1)], suffix='sd', samples_query='special_day==1', add_to_source_df=True)
# let's add a 5-day rolling mean for special day volumes, e.g. rolling.mean(5).shift(1) for 'daily_volume', 'close_volume', 'close_pct' for samples with special_day==1
df_ = compute_derived_feature(df_, ['daily_volume', 'close_volume', 'close_pct'], [('rolling_mean', 5), ('shift', 1)], suffix='sd_lbd', samples_query='(special_day==1) & (date_type_monthly_last_business_day==1)', add_to_source_df=True)
df_ = compute_derived_feature(df_, ['daily_volume', 'close_volume', 'close_pct'], [('rolling_mean', 5), ('shift', 1)], suffix='sd_qr', samples_query='(special_day==1) & (date_type_msci_quarterly_review==1)', add_to_source_df=True)
df_ = compute_derived_feature(df_, ['daily_volume', 'close_volume', 'close_pct'], [('rolling_mean', 5), ('shift', 1)], suffix='sd_sar', samples_query='(special_day==1) & (date_type_msci_semi_annual_review==1)', add_to_source_df=True)
df_ = compute_derived_feature(df_, ['daily_volume', 'close_volume', 'close_pct'], [('rolling_mean', 5), ('shift', 1)], suffix='sd_qe', samples_query='(special_day==1) & (date_type_quarterly_expiry==1)', add_to_source_df=True)
# let's add a 1-day lag for special day volumnes, e.g shift(1) for 'daily_volume', 'close_volume', 'close_pct' for samples with special_day==1
df_ = compute_derived_feature(df_, ['daily_volume', 'close_volume', 'close_pct'], [('shift', 1)], suffix='sd', samples_query='special_day==1', add_to_source_df=True)
# let's add a 2-day lag for special day volumnes, e.g shift(2) for 'daily_volume', 'close_volume', 'close_pct' for samples with special_day==1
df_ = compute_derived_feature(df_, ['daily_volume', 'close_volume', 'close_pct'], [('shift', 2)], suffix='sd', samples_query='special_day==1', add_to_source_df=True)
# let's add a 3-day lag for special day volumnes, e.g shift(3) for 'daily_volume', 'close_volume', 'close_pct' for samples with special_day==1
df_ = compute_derived_feature(df_, ['daily_volume', 'close_volume', 'close_pct'], [('shift', 3)], suffix='sd', samples_query='special_day==1', add_to_source_df=True)
return df_.dropna()
# from df_avg let's create our advanced dataframe
df_avg_adv = create_advanced_df(df_avg)
df_avg_adv.head()
# using more available features to improve our baseline model
models = models_init()
test_size = 0.3
target_col_name = 'close_volume'
features = ['special_day', 'weekday', 'day', 'week', 'month', 'daily_volume', 'daily_volume_mean_20d', 'close_volume_mean_20d', 'close_pct_mean_20d', 'date_type_monthly_last_business_day',
'date_type_msci_quarterly_review', 'date_type_msci_semi_annual_review', 'date_type_quarterly_expiry',
'daily_volume_sd_rmean5_p1', 'close_volume_sd_rmean5_p1', 'close_pct_sd_rmean5_p1', 'daily_volume_sd_p1', 'close_volume_sd_p1',
'close_pct_sd_p1', 'daily_volume_sd_p2', 'close_volume_sd_p2', 'close_pct_sd_p2', 'daily_volume_sd_p3', 'close_volume_sd_p3', 'close_pct_sd_p3',
'close_volume_sd_lbd_rmean5_p1', 'close_volume_sd_qr_rmean5_p1', 'close_volume_sd_sar_rmean5_p1', 'close_volume_sd_qe_rmean5_p1'
]
features_shifts = [0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
with_normalization = True
kpis, dataset = run_pipeline(df_avg_adv, models, test_size, features, features_shifts, target_col_name, with_normalization_=with_normalization, show_log_=False)
show_pipeline_results(dataset['features'], kpis)
We can see that adding those extra features did improve out of sample R^2 and MAE significantly.
# let's plot in & out of sample predictions for the top 5 models
# note: we will zoom on the first 40 out-sample predictions, you can pan right or left to visualize more out-sample predictions or in-sample ones.
# note: special day will be marked with a black circle.
# note: the separation between in & out of sample data is marked with a vertical black line.
top_5_model_names = list(kpis.sort_values('test_r2', ascending=False).index[:5])
test_start_date = dataset['x'].index[dataset['x_train'].shape[0]]
init_end_date = dataset['x'].index[dataset['x_train'].shape[0]+40]
x_range_init = (pd.to_datetime(test_start_date), pd.to_datetime(init_end_date))
chart_predictions(dataset['x'], dataset['y'], dataset['x_train'], dataset['y_train'], dataset['x_test'], dataset['y_test'], models, top_5_model_names, x_range_init=x_range_init)
In this experiment we will use ARIMA to predict close_volume.
# we choose a multiplicative model because intuitively as daily volumes increase, close volumes would too and so would their fluctuations.
# we choose a frequency of 63 periods (quarterly) as it represent a good average of the events/special days we're trying to modelize.
close_volume_ts = df_avg.set_index('date')['close_volume']
decomposition = seasonal_decompose(close_volume_ts, model='multiplicative', freq=63)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
plt.figure(figsize=(16,10))
plt.figure(1)
plt.subplot(411)
plt.plot(trend*seasonal*residual, label = 'Original')
plt.plot(trend, label = 'Trend',linestyle='--')
plt.legend(loc = 'best')
plt.subplot(412)
plt.plot(seasonal, label = 'Seasonality')
plt.legend(loc = 'best')
plt.subplot(413)
plt.plot(residual, label = 'Residuals')
plt.legend(loc = 'best')
plt.tight_layout()
We can see from this decomposition analysis that close_volume is made up of a slight general up trend, several seasonal components (roughly every quarter) and a non-systematic residual component.
Here we try to ascertain how much the data is correlated with itself.
In other words we're searching for 3 parameters:
p: number of lag observationsd: the degree of differencingq: the size of the moving window# searching for d by plotting the autocorellation function
plot_acf(close_volume_ts)
plt.xlim(0,500)
plt.show()
close_volume is non-stationary, we need to apply first order differencing.
We will use a first order differencing (d==1) to make close_volume stationary.
decomposition = seasonal_decompose(close_volume_ts.diff(1).dropna(), model='additive', freq=63)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
plt.figure(figsize=(16,10))
plt.figure(1)
plt.subplot(411)
plt.plot(trend*seasonal*residual, label = 'Original')
plt.plot(trend, label = 'Trend',linestyle='--')
plt.legend(loc = 'best')
plt.subplot(412)
plt.plot(seasonal, label = 'Seasonality')
plt.legend(loc = 'best')
plt.subplot(413)
plt.plot(residual, label = 'Residuals')
plt.legend(loc = 'best')
plt.tight_layout()
The above plots, confirm that the time series reaches stationarity with one order of differencing.
Let's compare autocorrelation and partial co
def plot_autocorrelation(data, corr_lag=50):
plt.rcParams.update({'figure.figsize':(16,7), 'figure.dpi':120})
fig, axes = plt.subplots(3, 3, sharex=False)
axes[0, 0].plot(data.values); axes[0, 0].set_title('Original Series')
plot_acf(data, lags=corr_lag, ax=axes[0, 1])
plot_pacf(data, lags=corr_lag, ax=axes[0, 2])
# 1st Differencing
axes[1, 0].plot(data.diff(1).dropna().values[:50]); axes[1, 0].set_title('1st Order Differencing')
plot_acf(data.diff(1).dropna(), lags=corr_lag, ax=axes[1, 1])
plot_pacf(data.diff(1).dropna(), lags=corr_lag, ax=axes[1, 2])
# 2nd Differencing
axes[2, 0].plot(data.diff(2).dropna().values[:50]); axes[2, 0].set_title('2nd Order Differencing')
plot_acf(data.diff(2).dropna(), lags=corr_lag, ax=axes[2, 1])
plot_pacf(data.diff(2).dropna(), lags=corr_lag, ax=axes[2, 2])
plt.show()
def plot_normal_probability(data):
plt.rcParams.update({'figure.figsize':(7, 3), 'figure.dpi':120})
plt.subplot(121)
data.hist(bins=50)
plt.subplot(122)
stats.probplot(data, plot=plt)
plot_normal_probability(close_volume_ts)
Our data is clearly not normal
Let's try to apply a log transform and a first order differencing to normalize it more and make it stationary at the same time.
plot_normal_probability(np.log(close_volume_ts).diff(1).dropna())
This is better but far from being a normal distribution!
# let's plot the correlation of our time series observations, with observations from previous time steps (or lags).
# let's plot the same of the first and second order differencing times series as well.
plot_autocorrelation(close_volume_ts, corr_lag=20)
# let's plot the correlation of our log and 1-order differencing transformed time series observations, with observations from previous time steps (or lags).
# let's plot the same of the first and second order differencing times series as well.
plot_autocorrelation(np.log(close_volume_ts).diff(1).dropna(), corr_lag=20)
From the above we can see that our transformed data is stationary.
We could use Autocorrelation plots and Partial Autocorrelation plots to determine q and p but we will grid search it instead using the Akaike Information Critera (AIC).
# let's create a class to support the sklearn API (fit, predict) so we can integrate this model to aformentionned utility functions.
class Sarimax:
def __init__(self, data, true_data=None, is_log_data=False, is_diff_data=False, data_diff_order=0, params=None):
self.data = data
self.true_data = true_data
self.is_log_data = is_log_data
self.is_diff_data = is_diff_data
self.data_diff_order = data_diff_order
self.train = None
self.test = None
self.params = params
self.model = None
self.order = None
self.seasonal_order = None
self.seasonal_periodicity = 1
self.test_size = 0.3
self.end_train_idx = None
self.start_time = None
self.end_time = None
self.kpi = None
self.df_preds = None
self.set_params(self.params)
self.split_train_test_data()
def set_params(self, params):
self.params = params
if params is not None:
if 'order' in self.params:
self.order = params['order']
if 'seasonal_order' in self.params:
self.seasonal_order = params['seasonal_order']
if 'seasonal_periodicity' in self.params:
self.seasonal_periodicity = params['seasonal_periodicity']
if 'test_size' in self.params:
self.test_size = params['test_size']
def grid_search_arima_params(self, data, seasonal_periodicity=1):
Qs = range(0, 1)
qs = range(0, 2)
Ps = range(0, 2)
ps = range(0, 2)
D=1
parameters = product(ps, qs, Ps, Qs)
parameters_list = list(parameters)
best_aic = float("inf")
best_arima = None
for param in parameters_list:
try:
arima = SARIMAX(data.values, order=(param[0], D, param[1]), seasonal_order=(param[2], D, param[3], seasonal_periodicity)).fit(disp=-1)
except:
continue
aic = arima.aic
if best_arima is None:
best_arima = arima
if aic < best_aic and aic:
best_arima = arima
best_aic = aic
print('best_aic: {}'.format(best_aic))
return best_arima, best_aic
def split_train_test_data(self):
# split data into train and test datasets
self.end_train_idx = int(self.data.shape[0]*(1-self.test_size))
self.train = self.data[:self.end_train_idx]
self.test = self.data[self.end_train_idx:]
print('train.shape: {}, test.shape: {}, end_train_idx: {}'.format(self.train.shape, self.test.shape, self.end_train_idx))
def fit(self):
self.start_time = time.time()
if self.order is not None and self.seasonal_order is not None:
best_order = self.order
best_seasonal_order = self.seasonal_order
else:
# grid search the best order and seasonal order
print('grid searching best order and seasonal order')
best_arima, best_aic = self.grid_search_arima_params(self.train, seasonal_periodicity=self.seasonal_periodicity)
best_order = best_arima.specification['order']
best_seasonal_order = best_arima.specification['seasonal_order']
print('best_order: {}'.format(best_order))
print('best_seasonal_order: {}'.format(best_seasonal_order))
# build and train the model
model = SARIMAX(self.train, order=best_order, seasonal_order=best_seasonal_order, enforce_stationarity=False, enforce_invertibility=False)
self.model = model.fit()
def predict_evaluate(self):
if self.model is None:
return np.array()
# create dataframe to keep train and test predictions
measure = self.data.name
self.df_preds = pd.DataFrame(self.data).rename(columns={self.data.name: 'true_{}'.format(self.data.name)})
# make predictions
y_train_preds = self.model.predict(0, self.end_train_idx - 1)
y_test_preds = self.model.predict(self.end_train_idx, self.end_train_idx + len(self.test) - 1)
self.df_preds[measure] = np.hstack([y_train_preds.ravel(), y_test_preds.ravel()])
#print('df_preds: {}'.format(self.df_preds.head()))
y_train = self.data[:self.end_train_idx]
y_test = self.data[self.end_train_idx:]
# if required inverse the log scalling of the predictions
if self.is_log_data and not self.is_diff_data:
y_train_preds = np.exp(y_train_preds)
y_test_preds = np.exp(y_test_preds)
self.df_preds.drop(columns=[measure], inplace=True)
self.df_preds[measure] = np.hstack([y_train_preds.ravel(), y_test_preds.ravel()])
y_train = np.exp(y_train)
y_test = np.exp(y_test)
# if required inverse the differencing of order self.data_diff_order to the predictions
if self.is_diff_data and self.data_diff_order>0:
if self.is_log_data:
y_train_preds = np.exp(y_train_preds)
y_test_preds = np.exp(y_test_preds)
print('before mult, y_train_preds: {}, y_test_preds: {}'.format(y_train_preds.shape, y_test_preds.shape))
print('mult true: {}'.format(self.true_data[:y_train_preds.shape[0]].shape))
y_train_preds = y_train_preds * self.true_data[self.data_diff_order:y_train_preds.shape[0]+1]
print('after mult, y_train_preds: {}'.format(y_train_preds.shape))
y_test_preds = y_test_preds * self.true_data[self.data_diff_order+y_train_preds.shape[0]: self.data_diff_order+y_train_preds.shape[0]+y_test_preds.shape[0]].values
self.df_preds.drop(columns=[measure], inplace=True)
# we remove the first self.data_diff_order lines as by differencing we've reduced our samples by self.data_diff_order rows!
#self.df_preds = self.df_preds[self.data_diff_order:]
print('Sarimax.predict, df_preds.shape: {}, y_train_preds.shape: {}, y_test_preds.shape: {}, y_train.shape: {}, y_test.shape: {}'.format(self.df_preds.shape, y_train_preds.shape, y_test_preds.shape, y_train.shape, y_test.shape))
self.df_preds[measure] = np.hstack([y_train_preds.ravel(), y_test_preds.ravel()])
y_train = self.true_data[self.data_diff_order:self.end_train_idx + self.data_diff_order]
y_test = self.true_data[self.end_train_idx + self.data_diff_order:]
print('Sarimax.predict, df_preds.shape: {}, y_train_preds.shape: {}, y_test_preds.shape: {}, y_train.shape: {}, y_test.shape: {}'.format(self.df_preds.shape, y_train_preds.shape, y_test_preds.shape, y_train.shape, y_test.shape))
# model evaluation using MSE and R2
train_mse = mean_squared_error(y_train, y_train_preds)
train_mae = mean_absolute_error(y_train, y_train_preds)
train_r2 = r2_score(y_train, y_train_preds)
test_mse = mean_squared_error(y_test, y_test_preds)
test_mae = mean_absolute_error(y_test, y_test_preds)
test_r2 = r2_score(y_test, y_test_preds)
self.end_time = time.time()
self.kpi = {'train_mse': train_mse,
'train_r2': train_r2,
'train_mae': train_mae,
'test_mse': test_mse,
'test_r2': test_r2,
'test_mae': test_mae,
'train_acc': 0,
'train_std': 0,
'processing_time': self.end_time - self.start_time
}
print('kpi: {}'.format(self.kpi))
return self.kpi
def add_model(model_id, model, model_kpi, global_kpi, global_models):
# add our model's kpi to global_kpi dataframe
global_kpi.loc[model_id] = list(v for k,v in model_kpi.items())
# add our model to global_models dictionary
global_models[model_id] = [model, {}, {}]
params = {'seasonal_periodicity': 21}
data = df_avg.set_index('date')['close_volume']
sarimax = Sarimax(data, params=params)
sarimax.fit()
sarimax.predict_evaluate()
add_model('Sarimax baseline', sarimax, sarimax.kpi, kpis, models)
params = {'seasonal_periodicity': 21}
data = df_avg.set_index('date')['close_volume']
log_data = np.log(data)
sarimax_log = Sarimax(log_data, true_data=data, is_log_data=True, params=params)
sarimax_log.fit()
sarimax_log.predict_evaluate()
add_model('Sarimax log', sarimax_log, sarimax_log.kpi, kpis, models)
params = {'seasonal_periodicity': 21}
data = df_avg.set_index('date')['close_volume']
log_data = np.log(data).diff(1).dropna()
sarimax_log_diff_1 = Sarimax(log_data, true_data=data, is_log_data=True, is_diff_data=True, data_diff_order=1, params=params)
sarimax_log_diff_1.fit()
sarimax_log_diff_1.predict_evaluate()
add_model('Sarimax log diff_1', sarimax_log_diff_1, sarimax_log_diff_1.kpi, kpis, models)
Let's plot close_volume predictions of these 3 SARIMAX models and compare them to our top 5 models discovered so far:
# let's plot in & out of sample predictions for the top 5 models
# note: we will zoom on the first 40 out-sample predictions, you can pan right or left to visualize more out-sample predictions or in-sample ones.
# note: special day will be marked with a black circle.
# note: the separation between in & out of sample data is marked with a vertical black line.
top_5_model_names = list(kpis.sort_values('test_mae', ascending=True).index[:5])
top_5_model_names += ['Sarimax baseline', 'Sarimax log', 'Sarimax log diff_1']
test_start_date = dataset['x'].index[dataset['x_train'].shape[0]]
init_end_date = dataset['x'].index[dataset['x_train'].shape[0]+40]
x_range_init = (pd.to_datetime(test_start_date), pd.to_datetime(init_end_date))
chart_predictions(dataset['x'], dataset['y'], dataset['x_train'], dataset['y_train'], dataset['x_test'], dataset['y_test'], models, top_5_model_names, x_range_init=x_range_init)
kpis.sort_values(['test_mae'])
We can conclude that SARIMAX doesn't fair well with this dataset compared to other regressors.
Sarimax log diff_1 does perform a bit better than its baseline and log peers but not by much.
models['Sarimax log diff_1'][0].model.plot_diagnostics(figsize=(10, 6))
plt.show()
Interpretation of the plot diagnostics:
Standardized residuals: The residual errors seem to fluctuate around a mean of zero and have a uniform variance.
Histogram plus estimates density: It suggests that the standardized residulas don't quite follow a normal distribution with mean zero (but still does it better than the other 2 SARIMAX models thanks to the log and the first order differencing applied to our data in this model.
Normal Q-Q: It shows clearly that our residuals distribution is skewed.
Correlogram: It shows that the residual errors are somewhat autocorrelated. Which implies that there is some pattern in the residual errors which are not explained in the model. So we should look for more predictors to the model (using the exogenous variables (not covered in this project).
close_pct¶# establishing a baseline model
models = models_init()
test_size = 0.3
target_col_name = 'close_pct'
features = ['special_day', 'weekday', 'date_type_monthly_last_business_day', 'date_type_msci_quarterly_review', 'date_type_msci_semi_annual_review', 'date_type_quarterly_expiry']
features_shifts = [0, 0, 0, 0, 0, 0]
with_normalization = True
kpis, dataset = run_pipeline(df_avg, models, test_size, features, features_shifts, target_col_name, with_normalization_=with_normalization)
show_pipeline_results(dataset['features'], kpis)
# let's plot in & out of sample predictions for the top 5 models
# note: we will zoom on the first 40 out-sample predictions, you can pan right or left to visualize more out-sample predictions or in-sample ones.
# note: special day will be marked with a black circle.
# note: the separation between in & out of sample data is marked with a vertical black line.
top_5_model_names = list(kpis.sort_values('test_r2', ascending=False).index[:5])
test_start_date = dataset['x'].index[dataset['x_train'].shape[0]]
init_end_date = dataset['x'].index[dataset['x_train'].shape[0]+40]
x_range_init = (pd.to_datetime(test_start_date), pd.to_datetime(init_end_date))
chart_predictions(dataset['x'], dataset['y'], dataset['x_train'], dataset['y_train'], dataset['x_test'], dataset['y_test'], models, top_5_model_names, x_range_init=x_range_init)
# using more available features to improve our baseline model
models = models_init()
test_size = 0.3
target_col_name = 'close_pct'
features = ['special_day', 'weekday', 'daily_volume', 'daily_volume_mean_20d', 'close_volume_mean_20d', 'close_pct_mean_20d', 'date_type_monthly_last_business_day', 'date_type_msci_quarterly_review', 'date_type_msci_semi_annual_review', 'date_type_quarterly_expiry']
features_shifts = [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
with_normalization = True
kpis, dataset = run_pipeline(df_avg, models, test_size, features, features_shifts, target_col_name, with_normalization_=with_normalization, show_log_=False)
show_pipeline_results(dataset['features'], kpis)
So again adding features pertaining to daily volume, daily volume rolling mean, close volume rolling mean and close pct rolling mean significantly improved our results.
# let's plot in & out of sample predictions for the top 5 models
# note: we will zoom on the first 40 out-sample predictions, you can pan right or left to visualize more out-sample predictions or in-sample ones.
# note: special day will be marked with a black circle.
# note: the separation between in & out of sample data is marked with a vertical black line.
top_5_model_names = list(kpis.sort_values('test_r2', ascending=False).index[:5])
test_start_date = dataset['x'].index[dataset['x_train'].shape[0]]
init_end_date = dataset['x'].index[dataset['x_train'].shape[0]+40]
x_range_init = (pd.to_datetime(test_start_date), pd.to_datetime(init_end_date))
chart_predictions(dataset['x'], dataset['y'], dataset['x_train'], dataset['y_train'], dataset['x_test'], dataset['y_test'], models, top_5_model_names, x_range_init=x_range_init)
Let's try to improve on the previous experiment by adding even more features to our dataset.
We will be using the same derived features than in section 6.2.3 when we were predicting close_volume.
# from df_avg let's create our advanced dataframe
df_avg_adv = create_advanced_df(df_avg)
df_avg_adv.head()
# using more available features to improve our baseline model
models = models_init()
test_size = 0.3
target_col_name = 'close_pct'
features = ['special_day', 'weekday', 'day', 'week', 'month', 'daily_volume', 'daily_volume_mean_20d', 'close_volume_mean_20d', 'close_pct_mean_20d', 'date_type_monthly_last_business_day',
'date_type_msci_quarterly_review', 'date_type_msci_semi_annual_review', 'date_type_quarterly_expiry',
'daily_volume_sd_rmean5_p1', 'close_volume_sd_rmean5_p1', 'close_pct_sd_rmean5_p1', 'daily_volume_sd_p1', 'close_volume_sd_p1',
'close_pct_sd_p1', 'daily_volume_sd_p2', 'close_volume_sd_p2', 'close_pct_sd_p2', 'daily_volume_sd_p3', 'close_volume_sd_p3', 'close_pct_sd_p3',
'close_volume_sd_lbd_rmean5_p1', 'close_volume_sd_qr_rmean5_p1', 'close_volume_sd_sar_rmean5_p1', 'close_volume_sd_qe_rmean5_p1'
]
features_shifts = [0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
with_normalization = True
kpis, dataset = run_pipeline(df_avg_adv, models, test_size, features, features_shifts, target_col_name, with_normalization_=with_normalization, show_log_=False)
show_pipeline_results(dataset['features'], kpis)
# let's plot in & out of sample predictions for the top 5 models
# note: we will zoom on the first 40 out-sample predictions, you can pan right or left to visualize more out-sample predictions or in-sample ones.
# note: special day will be marked with a black circle.
# note: the separation between in & out of sample data is marked with a vertical black line.
top_5_model_names = list(kpis.sort_values('test_r2', ascending=False).index[:5])
test_start_date = dataset['x'].index[dataset['x_train'].shape[0]]
init_end_date = dataset['x'].index[dataset['x_train'].shape[0]+40]
x_range_init = (pd.to_datetime(test_start_date), pd.to_datetime(init_end_date))
chart_predictions(dataset['x'], dataset['y'], dataset['x_train'], dataset['y_train'], dataset['x_test'], dataset['y_test'], models, top_5_model_names, x_range_init=x_range_init)
Adding derived features did improve our out of sample results.
Data question:First let's anwer our initial data question Can we predict future close volumes from historical volumes and other factors?
Yes we managed to predict close volumes and close volumes percentages with a fairly low error from historical volumes, special events and features derived from these 2 core feature classes!
Best model:In both prediction exercises (close volumes and close volumes percentages) the same model came out on top: Linear Regression.
For our most advanced experiment (with the derived features added) it reached out of sample r^2 above 0.68 which is a significant improvement compared to the bases models.
ARIMA:Despite several attemps at making ARIMA based models perform better I was unsuccessful in bringing one towards the top of the model ranking table.
Looking at the Mean Absolute Error, the best ARIMA model was still a good 30% worse than the best performing model.
What I learnt:Data preparation took a much much longer time than I would have ever anticipated. We do read and hear about this quite often but I've definitely experienced it first hand in this small project!
Code factorization was a great time saver in allowing me to scale up the number of experiments I was able to carry out in the limited time I had to complete this project. So definitely a practice I will keep using in other projects.
Going further:I would have loved to have time to do 2 things:
Averaging, Boosting and Stacking which essentially use several models to produce a better prediction than any of the individual models alone.